Technical Reports on Mathematical and Computing Sciences: Tr-c136 Title: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms (extended Revised Version of Tr-c131)

نویسندگان

  • Carlos Domingo
  • Ricard Gavalda
  • Osamu Watanabe
چکیده

Scalability is a key requirement for any KDD and data mining algorithm, and one of the biggest research challenges is to develop methods that allow to use large amounts of data. One possible approach for dealing with huge amounts of data is to take a random sample and do data mining on it, since for many data mining applications approximate answers are acceptable. However, as argued by several researchers, random sampling is difficult to use due to the difficulty of determining an appropriate sample size. In this paper, we take a sequential sampling approach for solving this difficulty, and propose an adaptive sampling method that solves a general problem covering many actual problems arising in applications of discovery science. An algorithm following this method obtains examples sequentially in an on-line fashion, and it determines from the obtained examples whether it has already seen a large enough number of examples. Thus, sample size is not fixed a priori; instead, it adaptively depends on the situation. Due to this adaptiveness, if we are not in a worst case situation as fortunately happens in many practical applications, then we can solve the problem with a number of examples much smaller than the required in the worst case. We prove the correctness of our method and estimates its efficiency theoretically. For illustrating its usefulness, we consider one concrete example of using sampling, provide an algorithm based on our method, and show its efficiency by experimental evaluation. (This an extended revised version of TR-C131.)

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Title: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms (extended Revised Version of Tr-c131)

Scalability is a key requirement for any KDD and data mining algorithm, and one of the biggest research challenges is to develop methods that allow to use large amounts of data. One possible approach for dealing with huge amounts of data is to take a random sample and do data mining on it, since for many data mining applications approximate answers are acceptable. However, as argued by several ...

متن کامل

Technical Reports on Mathematical and Computing Sciences: TR-CXXX title: Algorithmic Aspects of Boosting

We discuss algorithmic aspects of boosting techniques, such as Majority Vote Boosting [Fre95], AdaBoost [FS97], and MadaBoost [DW00a]. Considering a situation where we are given a huge amount of examples and asked to find some rule for explaining these example data, we show some reasonable algorithmic approaches for dealing with such a huge dataset by boosting techniques. Through this example, ...

متن کامل

Technical Reports on Mathematical and Computing Sciences: Tr-c123 Title: Practical Algorithms for On-line Sampling

One of the core applications of machine learning to knowledge discovery consists on building a function (a hypothesis) from a given amount of data (for instance a decision tree or a neural network) such that we can use it afterwards to predict new instances of the data. In this paper, we focus on a particular situation where we assume that the hypothesis we want to use for prediction is very si...

متن کامل

Automatically transforming regularly structured linear documents into Hypertext

s from the year’s technical reports.7 A sample abstract may be seen in printed form as Figure 3. This technical report is issued jointly by UMIACS and by the Computer Science Department; hence it carries two identifying numbers. The abstract was described by the troff source of Figure 4, and the corresponding Hyperties article is shown in Figure 5. Here, the identifying numbers are used as the ...

متن کامل

Technical Reports on Mathematical and Computing Sciences: TR-C184

Based on a variation of the pigeon hole principle, we propose a propositional formula having a unique satisfying assignment and prove an exponential lower bound for computing each bit of its satisfying by the tree-like resolution. Our proof is obtained by following the outline proposed by Ben-Sasson and Wigderson [BW99].

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999